Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

34 ◾ Bioinformatics

1.6 PREPROCESSING OF THE FASTQ READS

In the above, we discussed the assessment of the quality of the reads produced by the HTS

instruments to understand the potential errors and biases that may arise from warnings or

failures of the quality metrics. Before moving on to the next step for data analysis, errors

and biases should be adjusted to avoid incorrect results and misleading interpretation. In

general, there are three common approaches to fix the biases resulted from the quality

metrics. Those three approaches include (i) trimming the ends of the reads, (ii) removing

low-quality reads, and (iii) masking low-quality bases. The use of any of those approaches

depends on the quality problem. In the following, we will discuss the most commonly used

programs to deal with read quality issues.

The most commonly used software for the processing of raw sequence reads in FASTQ

files is FASTX-toolkit [14], which is a collection of command-line programs. The installa-

tion instructions of FASTX-toolkit are available at “http://hannonlab.cshl.edu/fastx_tool-

kit/download.html”. We can download and install it on Linux using the following steps:

Create a directory in which you can download the FASTX-toolkit compressed file:

mkdir fastxtoolkit

cd fastxtoolkit

Download the compressed program file and decompress it:

wget http://hannonlab.cshl.edu/fastx_toolkit/fastx_toolkit_0.0.13_

binaries_Linux_2.6_amd64.tar.bz2

tar xvf fastx_toolkit_0.0.13_binaries_Linux_2.6_amd64.tar.bz2

Copy the program files from the “bin” directory to “/usr/local/bin” so that it can be exe-

cuted from any directory on the computer:

sudo cp ./bin/* /usr/local/bin

FIGURE 1.27 Failed k-mer content.